Evaluation of Syntactic Phrase Indexing -- CLARIT NLP Track Report

نویسندگان

  • Xiang Tong
  • ChengXiang Zhai
  • Natasa Milic-Frayling
  • David A. Evans
چکیده

The CLARIT NLP track e ort is focused on evaluating the usefulness of syntactic phrases for document indexing. The CLARIT system has several NLP techniques integrated with the vector space retrieval model [Evans et al. 91, Evans et al. 95]. The NLP techniques used in CLARIT include morphological analysis, robust noun-phrase parsing, and automatic construction of rst order thesauri, among others. One main feature of CLARIT indexing is that it emphasizes phrase-based indexing with di erent options for decomposing noun phrases into smaller constituents, including single words. In past TRECs, the default mode for indexing involved full noun phrases, single words, and occasionally selected sub-phrases. 2 While some early experiments have shown the e ectiveness of noun phrases for indexing [Evans et al. 91], there is no direct evaluation of their e ectiveness in the context of TREC. In particular, the contribution of small sub-phrases to retrieval performance has not been evaluated outside the context of overall system performance. The version of the CLARIT system that we used in the experiments has its NLP component tightly integrated with the rest of the system. This does not allow easy evaluation of the individual NLP components. We, therefore, developed separate NLP modules and used CLARIT as a retrieval engine only to evaluate the e ectiveness of phrase-based indexing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Experiments in Query Optimization The CLARIT System TREC-6 Report

In general, all CLARIT processing for TREC-6 tasks (except Chinese) took advantage of standard CLARIT indexing, which involves a natural-language processing of source texts to identify and normalize noun phrases, sub-phrases, and individual words. In addition, most processing involved one or more methods for the identification of terms to supplement a query or information profile, including the...

متن کامل

Fast Statistical Parsing of Noun Phrases for Document Indexing

Information Retrieval (IR) is an important application area of Natural Language Processing (NLP) where one encounters the genuine challenge of processing large quantities of unrestricted natural language text. While much effort has been made to apply NLP techniques to IR, very few NLP techniques have been evaluated on a document collection larger than several megabytes. Many NLP techniques are ...

متن کامل

Comparing the E ect of Syntactic vs . StatisticalPhrase Indexing Strategies for

In this paper we describe the results of experiments contrasting syntactic phrase indexing with statistical phrase indexing for Dutch texts. Our results showed that we at least need a compound splitting algorithm for good quality retrieval for Dutch texts. If we then add either syntactic or statistical phrases, performance generally improves, but this eeect is never statistically signiicant. If...

متن کامل

Design and Evaluation of the CLARIT-TREC-2 System

All of the results we report in this paper follow from straightforwardapplications of base-level CLARIT processing, utilizing essentially the same CLARIT components that were employed in the CLARIT–TREC1 system. The general improvements we observe in CLARIT–TREC-2 processing are attributable tomodifications (especially simplifications) in processing steps and in the settings of system variables...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996